Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following object is masked from 'package:purrr':
transpose
The following objects are masked from 'package:dplyr':
between, first, last
Date Source Site.ID POC
Length:15976 Length:15976 Min. :60010007 Min. :1.000
Class :character Class :character 1st Qu.:60290014 1st Qu.:1.000
Mode :character Mode :character Median :60590007 Median :1.000
Mean :60549600 Mean :1.581
3rd Qu.:60731002 3rd Qu.:1.000
Max. :61131003 Max. :6.000
Daily.Mean.PM2.5.Concentration Units Daily.AQI.Value
Min. : 0.00 Length:15976 Min. : 0.00
1st Qu.: 7.00 Class :character 1st Qu.: 39.00
Median : 12.00 Mode :character Median : 56.00
Mean : 16.12 Mean : 59.28
3rd Qu.: 20.50 3rd Qu.: 72.00
Max. :104.30 Max. :185.00
Local.Site.Name Daily.Obs.Count Percent.Complete AQS.Parameter.Code
Length:15976 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88215
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88502
Max. :1 Max. :100 Max. :88502
AQS.Parameter.Description Method.Code Method.Description CBSA.Code
Length:15976 Min. :117 Length:15976 Min. :12540
Class :character 1st Qu.:120 Class :character 1st Qu.:23420
Mode :character Median :120 Mode :character Median :40140
Mean :297 Mean :33270
3rd Qu.:707 3rd Qu.:41740
Max. :810 Max. :49700
NA's :929
CBSA.Name State.FIPS.Code State County.FIPS.Code
Length:15976 Min. :6 Length:15976 Min. : 1.00
Class :character 1st Qu.:6 Class :character 1st Qu.: 29.00
Mode :character Median :6 Mode :character Median : 59.00
Mean :6 Mean : 54.78
3rd Qu.:6 3rd Qu.: 73.00
Max. :6 Max. :113.00
County Site.Latitude Site.Longitude
Length:15976 Min. :32.63 Min. :-124.2
Class :character 1st Qu.:34.07 1st Qu.:-121.4
Mode :character Median :35.36 Median :-119.1
Mean :36.00 Mean :-119.4
3rd Qu.:37.77 3rd Qu.:-117.9
Max. :41.71 Max. :-115.5
Date Source Site.ID POC
Length:59756 Length:59756 Min. :60010007 Min. : 1.00
Class :character Class :character 1st Qu.:60290019 1st Qu.: 1.00
Mode :character Mode :character Median :60631006 Median : 3.00
Mean :60563315 Mean : 3.77
3rd Qu.:60731026 3rd Qu.: 3.00
Max. :61131003 Max. :24.00
Daily.Mean.PM2.5.Concentration Units Daily.AQI.Value
Min. : -6.700 Length:59756 Min. : 0.00
1st Qu.: 4.100 Class :character 1st Qu.: 23.00
Median : 6.800 Mode :character Median : 38.00
Mean : 8.429 Mean : 39.28
3rd Qu.: 10.700 3rd Qu.: 54.00
Max. :302.500 Max. :454.00
Local.Site.Name Daily.Obs.Count Percent.Complete AQS.Parameter.Code
Length:59756 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88192
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88101
Max. :1 Max. :100 Max. :88502
AQS.Parameter.Description Method.Code Method.Description CBSA.Code
Length:59756 Min. :143 Length:59756 Min. :12540
Class :character 1st Qu.:170 Class :character 1st Qu.:31080
Mode :character Median :170 Mode :character Median :40140
Mean :336 Mean :34957
3rd Qu.:707 3rd Qu.:41860
Max. :810 Max. :49700
NA's :4567
CBSA.Name State.FIPS.Code State County.FIPS.Code
Length:59756 Min. :6 Length:59756 Min. : 1.00
Class :character 1st Qu.:6 Class :character 1st Qu.: 29.00
Mode :character Median :6 Mode :character Median : 63.00
Mean :6 Mean : 56.19
3rd Qu.:6 3rd Qu.: 73.00
Max. :6 Max. :113.00
County Site.Latitude Site.Longitude
Length:59756 Min. :32.58 Min. :-124.2
Class :character 1st Qu.:34.07 1st Qu.:-121.4
Mode :character Median :36.49 Median :-119.6
Mean :36.24 Mean :-119.6
3rd Qu.:37.96 3rd Qu.:-117.9
Max. :41.76 Max. :-115.5
Daily PM2.5
summary(epa02$Daily.Mean.PM2.5.Concentration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
summary(epa22$Daily.Mean.PM2.5.Concentration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.700 4.100 6.800 8.429 10.700 302.500
sum(is.na(epa02$Daily.Mean.PM2.5.Concentration))
[1] 0
sum(is.na(epa22$Daily.Mean.PM2.5.Concentration))
[1] 0
There are a total of 22 variables for each year’s EPA summary. There are no missing values for 2002 and 2022, however a negative value for the daily mean 2.5 concentration in 2022 suggests pottential issues in the data.
Step 2
Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
epa_merge =merge(x = epa02,y = epa22, all=TRUE)
Date is transformed to a date format and the year variable is created in a numeric format.
Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
There are more stations in 2022 than in 2002. The new sites are also in areas that are more densely populated like Los Angeles and San Francisco. Densely populated areas are on average more polluted due to transportation and reliance on cars. Monitoring PM 2.5 concentrations in these areas may be useful in understanding patterns and develop policies affecting a significant portion of Californians.
Step 4
Check for any missing or implausible values of PM in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
sum(is.na(epa_merge$dailyPM2.5))
[1] 0
setorder(epa_merge, dailyPM2.5)epa_merge %>%select(Date, dailyPM2.5, dailyAQI, Local.Site.Name)#data frame hidden becuase it's too large
setorder(epa_merge, -dailyPM2.5)epa_merge %>%select(Date, dailyPM2.5, dailyAQI, Local.Site.Name)#data frame hidden becuase it's too large
In terms of location there does not appear to be a pattern for the best air quality dates. The lowest daily pm 2.5 values were in January of both 2002 and 2022. The highest pm 2.5 values were recorded between the end of July and mid Septemer of 2022 (Summer 2022). This may be due to wildfire season which led to a lot of particulate matter and pollition in the air during htis time. The highest pm 2.5 concentration was recorded on July 31st, 2022 at 302.5 ug/m^3 in Yreka. This aligns with the McKinney fire that happened during the same summer in late July of 2022.
Step 5
Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
state
epa_merge$Year1 <-as.factor(epa_merge$Year)
library(ggplot2)epa_merge$Year1 <-relevel(epa_merge$Year1,'2022')ggplot(epa_merge, aes(x = dailyPM2.5, fill = Year1)) +geom_histogram(bins=100, color='black',alpha=0.5,position ='identity') +labs(title="Distribution of sites by Daily PM2.5 Concentration in 2002 and 2022", x="Daily PM2.5 Concentration", y="Count")+xlim(0,100)
Warning: Removed 254 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).
ggplot(epa_merge, aes(x = dailyPM2.5, fill = Year1)) +geom_histogram(bins=100,position ='dodge') +labs(title="Distribution of sites that reported unhealthy Daily PM2.5 Concentration in 2002 and 2022", x="Daily PM2.5 Concentration", y="Count")+xlim(35,310)
Warning: Removed 73652 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_bar()`).
# A tibble: 2 × 7
Year mean median sd min max IQR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2002 16.1 12 13.9 0 104. 13.5
2 2022 8.43 6.8 7.64 -6.7 302. 6.6
There were more measurements reported in California during 2022 than in 2002 since more sites were built wihtin the 20 year period. Both years, 2022 and 2002, had a histogram that shows a positive skew in pm 2.5 concentration. The average daily pm 2.5 concentration, however, was higher in 2002 at 16.11 ug/m^3 LC than in 2022 at 8.42 ug/m^3. There was also a higher median , standar deviation, and IQR in 2002. While 2002 had more particulate matter air pollution on average, the highest pm 2.5 concentration was reported in July 2002 at 302 ug/m^3 LC.
According to the EPA, a 24 pm 2.5 concentration of 35 ug/m^3 LC and above is set to be unhealthy for sensitive gorups. The second histogram shows the distribution of pm 2.5 concentration for 35ug/m^3 LC and above in California. Less sites reported unhealthy air quality days in 2002 compared to 2002, indicating that air quality has improved over 20 years.
county
ggplot(epa_merge) +geom_point(mapping =aes(x = County, y = dailyPM2.5, colour =factor(Year))) +scale_color_manual(values=c("slateblue", "skyblue")) +labs(x ="County", y ="Daily. PM2.5 Concentration (ug/m^3 LC)") +theme(axis.text.x =element_text(angle =90, vjust = .5, size =5))
`summarise()` has grouped output by 'County'. You can override using the
`.groups` argument.
epa_County <- epa_merge %>%group_by(County,) ggplot(epa_County, aes(x =factor(Year), y = dailyPM2.5, fill =factor(Year))) +geom_boxplot() +labs(title ="Box Plot of Daily PM2.5 Concentrations in California Counties (2002 vs 2022)",x ="Year",y ="Daily PM2.5 Concentration (µg/m³)") +scale_fill_manual(values =c("skyblue", "lightgreen")) +theme_minimal() +theme(legend.position ="none")
Overall, mean daily pm2.5 concentrations were lower in 2022 across California counties. There were many cases (outliers) where mean daily pm 2.5 concentrations were higher in 2022. The highest daily pm concentrations appeared in mostly in the counties that were heavily influenced by the wildfires in 2022, for example, Mariposa, Nevada, Placer, Riverside, Siskiyou, and Trinity county.
SITE IN LA County
epa_MainStreet<- epa_merge %>%filter(Local.Site.Name =="Los Angeles-North Main Street")
ggplot(epa_MainStreet, aes(x =factor(Year), y = dailyPM2.5, fill =factor(Year))) +geom_boxplot() +labs(title ="Box Plot of Daily PM2.5 Concentrations in Los Angeles - North Main Street(2002 vs 2022)",x ="Year",y ="Daily PM2.5 Concentration (µg/m³)") +scale_fill_manual(values =c("skyblue", "lightgreen")) +theme_minimal() +theme(legend.position ="none")
# A tibble: 2 × 7
Year mean median sd min max IQR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2002 22.0 19.3 11.7 3.9 66.3 13
2 2022 11.6 10.9 4.57 2.4 38 5.98
Looking closer at the Los Angeles, North Main Street station, daily pm 2.5 concentratons were lower in 2022 than in 2002. This is shown with a mean daily pm 2.5 of 11.6 ug/m^3 in 2002 and 22.0 ug/m^3 in 2022. The decrease in the median and IQR, as well as the presence of fewer outliers suggest that overall air quality has become better at this one site in Los Angeles, likely due to regulatory measures or changes in environmental conditions.